Class 03

DATA1220-55, Fall 2024

Sarah E. Grabinski

2024-09-04

Load Packages for Today’s Slides

# Contains the describe() function for comprehensive data summaries
library(Hmisc)
# Contains data sources used in our text book
library(openintro)
# Contains the palmer penguins dataset
library(palmerpenguins)
# For scatterplot matrices
library(GGally)
# Always load the tidyverse last
library(tidyverse)

# Set favorite ggplot2 theme for visualizations
theme_set(theme_bw())

Homework is now due Monday

  • Please take advantage of the extra time to attend office hours and/or post on Campuswire for help with remaining homework questions

  • Late policy: “This homework is due by 6:00pm on Monday, 9/9/24. No credit will be lost for assignments received by 7:00pm to account for issues with uploading. 10% of the points will be deducted from assignments received by 9:00am on Tuesday, 9/10/24. Assignments turned in after this point are only eligible for 50% credit, so it benefits you to turn in whatever you have completed by the due date.”

Let’s talk about coding anxiety

  • It’s natural to be anxious about learning to code, but it has a bad reputation

  • Older coding languages are less “readable” and required a lot of memorization

  • Modern languages are more interpretable (i.e. function is named for what it does)

  • Someone has probably answered the question you have somewhere on the internet

What are your thoughts on ChatGPT?

It’s a lying liar that lies. It will make up functions that don’t exist, and troubleshooting its bugs has wasted countless of my hours. Using ChatGPT to write new code is risky at best. I do not recommend it.

That said, I have found ChatGPT occasionally useful for debugging code when I don’t understand the error that is being generated. It is also useful for generating custom markdown templates. That has stll led to dead ends and lost time though, so use at your own risk.

My Approach to Coding

Don’t waste time memorizing functions, package names, parameters, etc. Everything is just a quick Google search away. You should honestly be able to copy-paste a lot of your homework code from somewhere in the slides on that chapter. What you should focus on learning is how to recognize different types of data and which tools are best for analyzing that data. You won’t need every tool in your toolkit all the time, but you should know how to find them when you need them.

What is data science?

Data Science Pipeline

Source: Figure 1.1 in https://r4ds.hadley.nz/intro.html

Chatfield’s Six Rules for Data Analysis

  1. Do not attempt to analyze the data until you understand what is being measured and why.
  2. Find out how the data were collected.
  3. Look at the structure of the data.
  4. Carefully examine the data in an exploratory way, before attempting a more sophisticated analysis.
  5. Use your common sense at all times.
  6. Report the results in a clear, self-explanatory way.

Chatfield, Chris (1996) Problem Solving: A Statistician’s Guide, 2nd ed.

Chapter 1 Pipeline

Red boxes around the "Import", "Visualize" and "Communicate" components of the data science pipeline

Data science pipeline priorities for Chapter 1

Chapter 1 Objectives

  • Get to know you better

  • Set up R, RStudio, and Campuswire

  • Describe how data was collected

    • Study, sample, and target populations

    • Sampling procedures and principles

  • Identify what types of variables were measured

  • Import and summarize raw data

  • Create an exploratory visualization

  • Communicate findings using a Quarto markdown document

What we’ll tackle today…

  • Get to know you better

  • Set up R, RStudio, and Campuswire

  • Identify what types of variables were measured

  • Import and summarize raw data

  • Create an exploratory visualization

  • Communicate findings using a Quarto markdown document

What we’ll tackle on Friday…

  • Describe how data was collected

    • Study, sample, and target populations

    • Sampling procedures and principles

  • Communicate findings using a Quarto markdown document (cont.)

Introductory Survey

  • DATA1220-55 Fall 2024 Intro Survey (link)

  • “Getting to know you” exercise to help me serve you better

  • Should take fewer than 10 minutes to complete

  • 21 people have already responded – THANK YOU!

  • Worth FIVE FREE POINTS on Homework 1

Campuswire Forum

  • Class Feed (link, bookmark this page!)

  • Forum for homework issues, discussions, earning participation credit

  • Point-based system for asking questions, crowdsourcing answers

  • 22 people have completed registration – THANK YOU!

  • Worth FIVE FREE POINTS on Homework 1

Making A New Post

Make a new post on Campuswire by selecting new post, labeling it as a question, giving it a topic, and posting to everyone

Post a question on Campuswire by selecting the blue “+ New post” button, creating a new **question** using the drop-down menu, tagging the question topic, giving it a title and description, and posting. Be sure to post a new **question** and not a note (default option) for full credit!

Example Question

A student post asking for help on Homework 1 in the class feed on Campuswire

A magnificent example of a brave student using Campuswire to crowdsource help on their homework

Interacting With Questions

A question post on Campuswire with a blue thumbs-up icon ("Like"), blue "Answer this question" button, and blue up-facing arrow with the number 1 circled in red, showing ways to interact with question posts

Interact with student-posed questions and discussion posts by liking the post, answering the question, or up-voting the answer(s) you think are best.

Participation Points on Campuswire

A student answer to a question with their participation score and level icon circled in red.

Participation scores will appear directly to the right of names with a bird icon indicating their level/status. Click on the icon to pull up how to earn participation points, the interactions needed to reach different ranks/levels, and your current participation status.

Earning Participation Points

Students will receive +2 points for each like their questions receive, +2 points for asking a question on the feed, +5 points for answering a question on the feed, and +10 points for having their answer upvoted by another student

The number of participation points (called “reputation points” on Campuswire) received for each type of interaction on the class feed.

Participation Levels

A screenshot of the reputation levels showing levels 0-2, their corresponding icons, the interactions needed to reach each level, and your current progress.

The first 3 participation levels, their corresponding icons, the interactions needed to reach each level, and your current progress.

What’s the difference between R and RStudio?

  • R is an open source statistical programming language.

  • R comes with it’s own user interface called R Gui, but its functionality is limited.

We do NOT want to use R Gui.

What’s the difference between R and RStudio?

RStudio is an integrated development environment (IDE) with a variety of tools for working with coding languages like R and Python.

  • Smart code-highlighting for easy reading

  • Direct and “chunkable” code execution

  • Visualization capabilities

  • Environment, workspace, and file management

Downloading R v4.4.1

This must be done before you can use RStudio.

Installing RStudio Desktop

You may have to manually add an RStudio shortcut to your desktop.

How can I tell the difference?

A green smiley face over the RStudio desktop icon, which we want to be using, and a red cross-out symbol over the R 4.4.1 desktop, which we do not want to use.

We want to work in RStudio.

RStudio Files

We will work with 4 types of file in RStudio:

  1. R scripts, ending in .R: text files containing only R code with no output

  2. Quarto markdown documents, ending in .qmd: rich text files that combine R code with markdown language and YAML headers to format the document

  3. HTML files, ending in .html: the rendered output of a Quarto markdown document that can be viewed in any standard web browser

  4. PDF files, ending in .pdf: the rendered output of a Quarto markdown document that can be viewed in any standard PDF viewer

Creating Raw Files in RStudio

Create raw text files by going to “File > New File” in RStudio and selecting…

  • R Script for .R files

  • Quarto Document for .qmd article-like documents

  • Quarto Presentation for .qmd powerpoint-like presentations

What does a raw .qmd file look like?

You can access the raw text, including the markdown language, by using the “Source” editor option.

Screenshot of the "Source" editor in RStudio with the option circled in red in the editor pane toolbar. The document type, "PDF" is also indicated in red in the YAML code.

The “Source” editor shows the raw code and markup language without any preprocessing by RStudio. The file type is set as “PDF” in the YAML header at the top.

Visual Editor for Raw Files

The "visual" editor option is selected in the toolbar for the editor pane and the editor is set as "visual" in the YAML code.

How to toggle between the “Source” and “Visual” editors in RStudio using the editor pane toolbar, along with how to set the default editor to “Visual” in the YAML header. The document type (“HTML”) is also specified in the YAML header.

We want to live in the visual editor for now!

Advantages of the Visual Editor

  • Format text, insert links/images, create tables, and more from the toolbar (no coding necessary!)

  • Preview document formatting without waiting for it to render over and over

  • Insert executable cells (“code chunks”) and other special features directly into the document

Creating Rendered Files in RStudio

Create rendered output files by (1) defining the document type in the YAML header and (2) running the “Render” process with the blue arrow icon at the top of the editor pane.

The "Render" process and its icon in the editor pane of RStudio circled in red.

Run the “Render” process with the blue arrow icon to generate HTML and PDF files from Quarto markdown documents.

WTH is a YAML?

  • A text header bound by 3 dashes (---) at the top and bottom of a .qmd file

  • Composed of key-value pairs, using the syntax key: value to define parameters

  • Defines the document type, formatting, default options, and other metadata for your project

I will write most of these for you in your homework templates.

Example YAML Header from Homework 1

Screenshot of the YAML header from the homework1_template.qmd file in the Chapter 1 module on Canvas.

The YAML header from the homework1_template.qmd file in the Chapter 1 module on Canvas. It establishes the document metadata (title, author, date, etc.), defines the document type (HTML), and sets the default editor to “Visual”.

Using Projects in RStudio

“Projects” in RStudio, files ending in .Rproj, allow you to divide your work into discrete containers for each project, keeping them separated with their own unique…

  • Folder containing associated files (called the “working directory”)

  • Data environment storing loaded packages, variables, and calculations (.RData files)

  • Work history containing executed code (.Rhistory files)

And the best parts?

  • Projects will autosave your open documents so you have less to recover in the event of a crash

  • Projects will store data and results objects in its “Global Environment” between sessions, so you don’t have to run the same calculations over and over each time

  • Projects allow you to work on 2+ projects simultaneously across multiple RStudio sessions

Creating a Project in RStudio

Screenshot of the pop-up prompt when creating a new project in RStudio, with options to start project in a new, existing, or version control directory.

The pop-up prompt when creating a new project in RStudio. We will use new or existing directories.

Installing Code Packages

  • Packages are collections of custom functions to use for statistical analyses

  • Some are installed with base R, and some need to be installed manually

    • Install and update using the “Packages” tab in RStudio

    • Install in the console using the install.packages('package_name') function

  • Packages are loaded into documents before any other code is written, using the library('package_name') function

Coding Notes - Assignment Operator

You can declare a variable in R one of two ways: using an = or using the assignment operator <-. I prefer the latter, and that’s what you’ll see most often in my code. Both are acceptable.

# Example: defining a variable with =
x = 5
# Display the variable by declaring it
x
[1] 5
# Example: defining a variable with <-
y <- 27
# Display the variable with the print() function
print(y)
[1] 27

Coding Notes - Pipe Operator

Create long chains of functions using the pipe operator |> to pass the results of one function as new input for the next function. We won’t do much with this until you begin writing more complex code.

# Load the palmer penguins dataset and pipe on
penguins |>
  # Calculate a new variable and pipe on
  mutate(bill_length_cm = bill_length_mm / 10) |>
  # Rename a variable and pipe on
  rename(gender = sex) |>
  # Display specific variables from final result
  select(species, bill_length_cm, gender)

Coding Notes - Pipe Operator

# A tibble: 344 × 3
   species bill_length_cm gender
   <fct>            <dbl> <fct> 
 1 Adelie            3.91 male  
 2 Adelie            3.95 female
 3 Adelie            4.03 female
 4 Adelie           NA    <NA>  
 5 Adelie            3.67 female
 6 Adelie            3.93 male  
 7 Adelie            3.89 female
 8 Adelie            3.92 male  
 9 Adelie            3.41 <NA>  
10 Adelie            4.2  <NA>  
# ℹ 334 more rows

Common Raw Data Formats

  • Preformatted data sets in packages (e.g. openintro package from our textbook)

  • R data files: .rds or .rda for single data objects, .RData for multiple data objects

  • Delimited files: text files with character separators

    • Comma separated values (.csv)

    • Tab-separated values (.tsv)

  • Excel spreadsheets (e.g. .xls, .xlsx)

How do I get data files into R?

I will demonstrate the code for how to import common raw data formats as we work through examples in the course, rather than show them to you all at once.

  • See Class 02, slides 17-19 for an example loading the palmerpenguins package data

  • The Homework 1 template includes code to import a comma-separated values file (.csv)

Once I import data, where does it go?

Imported data, user-defined variables, and calculated results will be stored in your project’s “Global Environment”.

What does data look like: Lists

# cbind-type list
c('object1', 'object2', 'object3')
[1] "object1" "object2" "object3"
# list-type list
list(1, 2, 3, 4, 5)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 5

What does data look like: Named lists

# named list
list(name = 'Sabrina', age = 25, 
     major = 'Data Science', grad_year = 2026)
$name
[1] "Sabrina"

$age
[1] 25

$major
[1] "Data Science"

$grad_year
[1] 2026

What does data look like: Matrices

# Input = a list of items
# Parameters = # of rows and/or columns
# Initialize cells by row = TRUE
matrix(data = seq(1:25), nrow = 5, 
       ncol = 5, byrow = T)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
[5,]   21   22   23   24   25
# Initialize cells by row = TRUE
matrix(data = seq(1:25), nrow = 5, 
       ncol = 5, byrow = F)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

What does data look like: Dataframes

# seq creates a list of numbers from low:high
# as.character changes variable type from numeric
data.frame(id = as.character(seq(1:5)), 
           month = c('June', 'June', 'June', 
                     'July', 'August'), 
           # in rep, the 2nd parameter is the number of 
           # times the 1st parameter is repeated in a list
           year = rep(2024, 5), 
           state = c('OH', 'OH', 'IN', 'IN', 'CA'), 
           # converts string variables to factors
           stringsAsFactors = T)
  id  month year state
1  1   June 2024    OH
2  2   June 2024    OH
3  3   June 2024    IN
4  4   July 2024    IN
5  5 August 2024    CA

What is a dataframe?

  • Data arranged in rows and columns like a typical spreadsheet

  • Each row (ideally) contains 1 unique observation of the data for each of the measured variables

  • Each column (ideally) contains all the observations for 1 unique variable that was measured

What does a dataframe look like?

Screenshot of the information available in the environment about dataframes, including the name, number of rows and columns, variable names, and data types.

The compact view shows the name (dataframe), the number of rows (obs.), and the number of columns (variables). A preview of the variable names and data types can be accessed with the blue drop-down.

What’s a data dictionary?

A formatted document, often a table, which provides information about the variables such as…

  • The variable names as seen in the raw data

  • Description of the variable measured

  • Units in which the variable was measured

  • Number of observations

  • Number of missing values

Types of Data

Flow chart with all variables being split into numerical and categorical variables. Numerical is further subdivided into continuous and discrete. Categorical is subdivided into nominal or ordinal.

The two primary types of data we’ll analyze are numerical and categorical variables.

Numerical vs Categorical Data

  • Numerical data is quantitative (measured) data.

  • Categorical data is qualitative (descriptive) data.

    • Binary categorical variables only have 2 categories (e.g. 1 or 0)

    • Multi-categorical variables have 3+ categories (e.g.

Numerical Variables: Continuous vs Discrete

  • Continuous numeric variables can take any value imaginable within a given range

    • Examples: degrees Celsius, weight in grams, time elapsed in milliseconds
  • Discrete numeric variables have a limited set of potential values

    • Examples: counts, time in months

Categorical Variables: Nominal vs Ordinal

  • Nominal categorical variables have no order

    • Rearranging the categories makes no difference
  • Ordinal categorical variables have a direction

    • The order of the categories has meaning

    • Example: On a scale from 1-5…, Strongly Agree to Strongly Disagree

How can I tell what kind of variable I have?

  • Inspect the data in the global environment

  • Print the data to the console or in a code chunk

  • Use the glimpse() function for a quick summary

  • Use the describe() function from the Hmisc package for a detailed summary

Inspect the data in the global environment

Screenshot of a dataframe in the global environment with variable types underlined in red.

Inspecting the object in the environment can provide details about what data types you have.

Use glimpse() for a quick summary

glimpse(penguins_raw)
Rows: 344
Columns: 17
$ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
$ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
$ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
$ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
$ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
$ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
$ `Date Egg`            <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,…
$ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
$ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
$ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
$ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
$ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18…
$ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298…
$ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…

Use Hmisc::describe() for a detailed summary

# describe is a common function name, so 
# it is a good habit to call this version
# directly from the package using package_name::
# to prevent conflicts and errors
Hmisc::describe(penguins)
penguins 

 8  Variables      344  Observations
--------------------------------------------------------------------------------
species 
       n  missing distinct 
     344        0        3 
                                        
Value         Adelie Chinstrap    Gentoo
Frequency        152        68       124
Proportion     0.442     0.198     0.360
--------------------------------------------------------------------------------
island 
       n  missing distinct 
     344        0        3 
                                        
Value         Biscoe     Dream Torgersen
Frequency        168       124        52
Proportion     0.488     0.360     0.151
--------------------------------------------------------------------------------
bill_length_mm 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2      164        1    43.92    6.274    35.70    36.60 
     .25      .50      .75      .90      .95 
   39.23    44.45    48.50    50.80    51.99 

lowest : 32.1 33.1 33.5 34   34.1, highest: 55.1 55.8 55.9 58   59.6
--------------------------------------------------------------------------------
bill_depth_mm 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       80        1    17.15    2.267     13.9     14.3 
     .25      .50      .75      .90      .95 
    15.6     17.3     18.7     19.5     20.0 

lowest : 13.1 13.2 13.3 13.4 13.5, highest: 20.7 20.8 21.1 21.2 21.5
--------------------------------------------------------------------------------
flipper_length_mm 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       55    0.999    200.9    16.03    181.0    185.0 
     .25      .50      .75      .90      .95 
   190.0    197.0    213.0    220.9    225.0 

lowest : 172 174 176 178 179, highest: 226 228 229 230 231
--------------------------------------------------------------------------------
body_mass_g 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       94        1     4202    911.8     3150     3300 
     .25      .50      .75      .90      .95 
    3550     4050     4750     5400     5650 

lowest : 2700 2850 2900 2925 2975, highest: 5850 5950 6000 6050 6300
--------------------------------------------------------------------------------
sex 
       n  missing distinct 
     333       11        2 
                        
Value      female   male
Frequency     165    168
Proportion  0.495  0.505
--------------------------------------------------------------------------------
year 
       n  missing distinct     Info     Mean      Gmd 
     344        0        3    0.888     2008   0.8919 
                            
Value       2007  2008  2009
Frequency    110   114   120
Proportion 0.320 0.331 0.349

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------

Variable Terms

  • Independent or explanatory variable

    • Typically on the x-axis

    • “Cause” variable

  • Dependent or response variable

    • Typically on the y-axis

    • “Effect” variable

Visualizing Data

  • Distributions and numerical summaries of both explanatory and response variables

    • Histogram, bar plot

    • Density or violin plots

    • Boxplots

  • Associations, relationships, correlations between explanatory and response variables

    • Scatter plots, regression

    • Scatterplot matrices

Histogram - How common are certain ranges of values? (Discrete)

# Pipe data into ggplot2
penguins |>   
  # Initialize the plot parameters with aes
  ggplot(aes(x = bill_length_mm)) + # ggplot2 only uses +! no pipes!
  # add a histogram to the plot
  geom_histogram(binwidth = 2, # each bin spans 2 mm                 
                 fill = 'steelblue', # some color for fun 
                 color = 'white') +  
  # Add titles and axis labels
  labs(title = 'Distribution of Bill Lengths',        
       subtitle = 'In Adelie, Chinstrap, and Gentoo Penguins',        
       x = 'Bill Length (mm)',        
       y = '# of Individuals')

Histogram - How common are certain ranges of values? (Discrete)

Density Plot - How common are certain ranges of values? (Continuous)

# Pipe data into ggplot2
penguins |>   
  # Initialize the plot parameters with aes
  ggplot(aes(x = bill_length_mm)) +   
  # add a density curve to the plot
  geom_density(fill = 'steelblue', # add some color and make it transparent
               alpha = 0.5) +   
  # Add titles and axis labels
  labs(title = 'Distribution of Bill Lengths',        
       subtitle = 'In Adelie, Chinstrap, and Gentoo Penguins',        
       x = 'Bill Length (mm)',        
       y = '# of Individuals')

Density Plot - How common are certain ranges of values? (Continuous)

Boxplot + Violin - Numerical summary + density curves

# Pipe data into ggplot2
penguins |>   
  # Initialize the plot parameters with aes
  ggplot(aes(x = species, y = bill_length_mm)) +   
  # Add a violin plot as the base layer
  geom_violin(aes(fill = species)) +   
  # Add a boxplot on top of the violin plot
  geom_boxplot(width = 0.3)

Boxplot + Violin - Numerical summary + density curves

Association vs. Independence

  • When two variables show some connection with one another, they are called associated variables.

  • If two variables are not associated, i.e. there is no evident connection between the two, then they are said to be independent.

Scatterplot Matrices - Quick Look at Many Relationships

This code will sometimes run slowly and generate lots of warning messages.

penguins |>
# Select variables of interest
  select(species, island, bill_length_mm,
         bill_depth_mm) |>
  # send to ggpairs to create the matrix
  ggpairs()

Scatterplot Matrices - Quick Look at Many Relationships

Scatter plot + Linear Regression - Detailed Look at 1 Relationship

# Pipe data into ggplot2
penguins |>
  # Set x and y variables with aes
  ggplot(aes(x = bill_depth_mm,
             y = bill_length_mm)) +
  # add a scatterplot
  geom_point() +
  # add a linear model regression line
  geom_smooth(formula = y ~ x, 
              # set method to lm
              method = 'lm', 
              # keep standard error shading
              se = T)

Scatter plot + Linear Regression - Detailed Look at 1 Relationship

Does this look like this regression line accurately describes the relationship between bill depth and bill length? Do you see any patterns in the points?

Negatively Correlated Variables

The regression line slopes downwards from the upper left-hand corner towards the lower right-hand corner.

  • Bill length is negatively associated with bill depth for all penguins sampled.

  • Bill length is negatively correlated to bill depth.

  • As bill depth increases, bill length decreases.

Scatter plot + LOESS Regression - Detailed Look at 1 Relationship

Using the localized regression technique LOESS can help you identify trends in your data that traditional linear models can miss. What do you see here?

# Pipe data into ggplot2
penguins |>
  # Set x and y variables with aes
  ggplot(aes(x = bill_depth_mm,
             y = bill_length_mm)) +
  # add a scatterplot
  geom_point() +
  # add a LOESS regression line
  geom_smooth(formula = y ~ x, 
              # set method to loess
              method = 'loess', 
              # change the color
              color = 'red',
              # remove standard error shading
              se = F)

Scatter plot + LOESS Regression - Detailed Look at 1 Relationship

What happens when consider additional variables in the data?

How does this plot differ from the first linear regression analysis on data from penguins not grouped by species?

What happens when consider additional variables in the data?

# Pipe data into ggplot2
penguins |>
  # Set x and y variables with aes
  ggplot(aes(x = bill_depth_mm,
             y = bill_length_mm, 
             # group the data by species
             group = species, 
             # color the points/lines by species
             color = species)) +
  # add a scatterplot
  geom_point(alpha = 0.5) +
  # add a linear model regression line
  geom_smooth(formula = y ~ x, 
              # set method to lm
              method = 'lm', 
              # keep standard error shading
              se = T)

Positively Correlated Variables

The regression line slopes upwards from the bottom left-hand corner towards the upper right-hand corner.

  • Bill length is positively associated with bill depth within each of the 3 penguin species.

  • Bill length is negatively correlated to bill depth

  • As bill depth increases, bill length decreases.

Take-Home Lessons

  • Conclusions are shaped by the assumptions we make during the analysis

  • Context is important!

  • A picture is worth a thousand words

Next time…

  • Please complete your Google surveys

  • Please register for Campuswire

  • Where does your data come from? (defining populations)

  • Principles of sampling (Skittles activity??)

  • DATA1220 pre-survey (FREE 2.5% of final grade)

    • Please contact me if you will not be in class Friday

Session Info

At the end of every project, you should include your session info. This function prints out your computer’s operating system, the installation of R you are using, and all of your installed packages plus version numbers. This is a good habit for producing reproducible research.

xfun::session_info()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  airports_0.1.0       askpass_1.2.0        backports_1.5.0     
  base64enc_0.1-3      bit_4.0.5            bit64_4.0.5         
  blob_1.2.4           broom_1.0.6          broom.helpers_1.17.0
  bslib_0.8.0          cachem_1.1.0         callr_3.7.6         
  cards_0.2.2          cellranger_1.1.0     checkmate_2.3.2     
  cherryblossom_0.1.0  cli_3.6.3            clipr_0.8.0         
  cluster_2.1.6        colorspace_2.1-1     compiler_4.4.1      
  conflicted_1.2.0     cpp11_0.5.0          crayon_1.5.3        
  curl_5.2.2           data.table_1.16.0    DBI_1.2.3           
  dbplyr_2.5.0         digest_0.6.37        dplyr_1.1.4         
  dtplyr_1.3.1         evaluate_0.24.0      fansi_1.0.6         
  farver_2.1.2         fastmap_1.2.0        fontawesome_0.5.2   
  forcats_1.0.0        foreign_0.8-87       Formula_1.2-5       
  fs_1.6.4             gargle_1.5.2         generics_0.1.3      
  GGally_2.2.1         ggplot2_3.5.1        ggstats_0.6.0       
  glue_1.7.0           googledrive_2.1.1    googlesheets4_1.1.1 
  graphics_4.4.1       grDevices_4.4.1      grid_4.4.1          
  gridExtra_2.3        gtable_0.3.5         haven_2.5.4         
  highr_0.11           Hmisc_5.1-3          hms_1.1.3           
  htmlTable_2.4.3      htmltools_0.5.8.1    htmlwidgets_1.6.4   
  httr_1.4.7           ids_1.0.1            isoband_0.2.7       
  jquerylib_0.1.4      jsonlite_1.8.8       knitr_1.48          
  labeling_0.4.3       labelled_2.13.0      lattice_0.22-6      
  lifecycle_1.0.4      lubridate_1.9.3      magrittr_2.0.3      
  MASS_7.3.61          Matrix_1.7-0         memoise_2.0.1       
  methods_4.4.1        mgcv_1.9-1           mime_0.12           
  modelr_0.1.11        munsell_0.5.1        nlme_3.1-166        
  nnet_7.3-19          openintro_2.5.0      openssl_2.2.1       
  palmerpenguins_0.1.1 patchwork_1.2.0      pillar_1.9.0        
  pkgconfig_2.0.3      plyr_1.8.9           prettyunits_1.2.0   
  processx_3.8.4       progress_1.2.3       ps_1.7.7            
  purrr_1.0.2          R6_2.5.1             ragg_1.3.2          
  rappdirs_0.3.3       RColorBrewer_1.1-3   Rcpp_1.0.13         
  readr_2.1.5          readxl_1.4.3         rematch_2.0.0       
  rematch2_2.1.2       reprex_2.1.1         rlang_1.1.4         
  rmarkdown_2.28       rpart_4.1.23         rstudioapi_0.16.0   
  rvest_1.0.4          sass_0.4.9           scales_1.3.0        
  selectr_0.4.2        splines_4.4.1        stats_4.4.1         
  stringi_1.8.4        stringr_1.5.1        sys_3.4.2           
  systemfonts_1.1.0    textshaping_0.4.0    tibble_3.2.1        
  tidyr_1.3.1          tidyselect_1.2.1     tidyverse_2.0.0     
  timechange_0.3.0     tinytex_0.52         tools_4.4.1         
  tzdb_0.4.0           usdata_0.3.1         utf8_1.2.4          
  utils_4.4.1          uuid_1.2.1           vctrs_0.6.5         
  viridis_0.6.5        viridisLite_0.4.2    vroom_1.6.5         
  withr_3.0.1          xfun_0.47            xml2_1.3.6          
  yaml_2.3.10